Nutch

Apache Nutch

Screenshot Nutch Web Interface Search
Developer(s)	Apache Software Foundation
Stable release	1.4 / December 26, 2011; 51 days ago (2011-12-26)
Development status	Active
Written in	Java
Operating system	Cross-platform
Type	Search Engine
License	Apache License 2.0
Website	nutch.apache.org

Nutch is an effort to build an open source web search engine based on Lucene Java for the search and index component.

1 Features
2 History
3 Advantages ^[2]
4 Scalability
5 Related projects
6 Search engines built with Nutch
7 See also
8 References
9 Bibliography
10 External links

Features

Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.

The fetcher ("robot" or "web crawler") has been written from scratch specifically for this project.

History

Nutch originated with Doug Cutting, creator of both Lucene and Hadoop, and Mike Cafarella.

In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multimachine processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system. The two facilities have been spun out into their own subproject, called Hadoop.

In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the Apache Software Foundation.^[1]

Advantages ^[2]

Some of the advantages of Nutch, when compared to a simple Fetcher

highly scalable and relatively feature rich crawler
features like politeness which obeys robots.txt rules
robust and scalable - you can run Nutch on a cluster of 100 machines
quality - you can bias the crawling to fetch “important” pages first

Scalability

IBM Research studied the performance^[3] of Nutch/Lucene as part of its Commercial Scale Out (CSO) project.^[4] Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the Power5.

The ClueWeb09 dataset (used in e.g. TREC) was gathered using Nutch, with an average speed of 755.31 documents per second.^[5]

Related projects

Hadoop - Java framework that supports distributed applications running on large clusters
nutchWAX - Uses Nutch to search a web archive
Sixearch - An unstructured peer network application, which provides a complementary way for users to actively and collaboratively share their own document collections.

Search engines built with Nutch

Creative Commons Search - launched 2004, Nutch implementation replaced 2006^[6]^[7]^[8]
DiscoverEd - Open educational resources search prototype developed by Creative Commons^[9]
Krugle
mozDex
Wikia Search - launched 2008, closed down 2009^[10]^[11]
search2.net
Tothego.com

References

Bibliography

Shoberg, J (October 26, 2006). Building Search Applications with Lucene and Nutch (1st ed.). Apress. pp. 350. ISBN 978-1590596876. http://www.apress.com/book/view/9781590596876.

External links

Official website
Official wiki
Building Nutch: Open Source Search（2004）- ACM Queue vol. 2, no. 2
An article about Nutch（2003）- Search Engine Watch
Another article about Nutch（2003）- Tech News World
Official page of the Hadoop project

Apache Software Foundation

Top level projects	Abdera ActiveMQ Ant Aries Apache HTTP Server APR Avro Axis Buildr Camel Cassandra Cayenne Chemistry Click Cocoon Continuum CouchDB CXF Derby Directory Felix Forrest Geronimo Gump Hadoop Hive HBase Jackrabbit James Karaf Lenya libcloud Mahout Maven MINA mod_perl MyFaces ODE OFBiz OpenEJB OpenJPA POI Pivot Qpid River Roller ServiceMix Shindig Shiro Sling SpamAssassin stdcxx Struts Subversion Tapestry Thrift Tomcat Trafficserver Tuscany UIMA Velocity Wicket Xerces XMLBeans

Jakarta Projects	BCEL BSF Cactus JMeter

Commons Projects	Daemon Sanselan Jelly

Lucene Projects	Lucene Java Droids Lucene.Net Lucy Nutch Open Relevance Project PyLucene Solr Tika

Hadoop Projects	HDFS ZooKeeper

Other projects	Chainsaw Batik FOP Log4j XAP Log4Net Ivy Wink

Incubator Projects	ACE Callback Composer Empire-db Hama JSPWiki OpenOffice.org XAP Wave Wink

Apache Attic	AxKit Beehive Bluesky Excalibur Harmony HiveMind Slide Shale iBATIS

License: Apache License Website: apache.org